Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.
Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.
Dataset Resource: [https://www.kaggle.com/imakash3011/customer-personality-analysis]
import pandas as pd
import numpy as np
from scipy import stats
import researchpy as rp
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.neighbors import LocalOutlierFactor
from sklearn.decomposition import PCA
1. Data Preview and Variable Corrections
2. Distribution of Variables
1. Distribution of Numeric Variables 2. Distribution of Categorical Variables
3. Crossover of Variables
1. Crossover of Categorical Variables 2. Crossover of Numeric Variables
Data Preprocessing
1. Data Cleaning
1. Missing Data
2. Outlier and Noisy Data2. Data Standardization
PCA(Principal Component Analysis)
1.Data Preview and Variable Corrections
data=pd.read_csv(r"C:\Users\mirza\OneDrive\Data Science\data\marketingss_campaign.csv", sep="\t")
data.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
5 rows × 29 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Z_CostContact 2240 non-null int64 27 Z_Revenue 2240 non-null int64 28 Response 2240 non-null int64 dtypes: float64(1), int64(25), object(3) memory usage: 507.6+ KB
data["Year_Birth"]= 2021-data["Year_Birth"]
data=data.rename(columns={"Year_Birth":"Customers_Age", "Response" : "Last_Campaign" })
data["Dt_Customer"]=pd.to_datetime(data["Dt_Customer"])
sort_cat=["Basic", "2n Cycle", "Graduation", "Master", "PhD"]
data['Education']= pd.Categorical(data["Education"], ordered=True, categories= sort_cat)
data['Marital_Status']= pd.Categorical(data["Marital_Status"])
data.head()
| ID | Customers_Age | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Last_Campaign | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 64 | Graduation | Single | 58138.0 | 0 | 0 | 2012-04-09 | 58 | 635 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 1 | 2174 | 67 | Graduation | Single | 46344.0 | 1 | 1 | 2014-08-03 | 38 | 11 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2 | 4141 | 56 | Graduation | Together | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426 | ... | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 3 | 6182 | 37 | Graduation | Together | 26646.0 | 1 | 0 | 2014-10-02 | 26 | 11 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4 | 5324 | 40 | PhD | Married | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
5 rows × 29 columns
I We apply the same process to the variable "Marital_Status" and convert it to a categorical variable.
I All of this process reason is to while using machine learning algorithms , they can understand variables better. So we can make better predict.
2. Distribution of Variables
1. Distribution of Numeric Variables
data_num=data.select_dtypes(include=["int64", "float64"])
data_num.describe()
| ID | Customers_Age | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Last_Campaign | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2240.000000 | 2240.000000 | 2216.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | ... | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.0 | 2240.0 | 2240.000000 |
| mean | 5592.159821 | 52.194196 | 52247.251354 | 0.444196 | 0.506250 | 49.109375 | 303.935714 | 26.302232 | 166.950000 | 37.525446 | ... | 5.316518 | 0.072768 | 0.074554 | 0.072768 | 0.064286 | 0.013393 | 0.009375 | 3.0 | 11.0 | 0.149107 |
| std | 3246.662198 | 11.984069 | 25173.076661 | 0.538398 | 0.544538 | 28.962453 | 336.597393 | 39.773434 | 225.715373 | 54.628979 | ... | 2.426645 | 0.259813 | 0.262728 | 0.259813 | 0.245316 | 0.114976 | 0.096391 | 0.0 | 0.0 | 0.356274 |
| min | 0.000000 | 25.000000 | 1730.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| 25% | 2828.250000 | 44.000000 | 35303.000000 | 0.000000 | 0.000000 | 24.000000 | 23.750000 | 1.000000 | 16.000000 | 3.000000 | ... | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| 50% | 5458.500000 | 51.000000 | 51381.500000 | 0.000000 | 0.000000 | 49.000000 | 173.500000 | 8.000000 | 67.000000 | 12.000000 | ... | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| 75% | 8427.750000 | 62.000000 | 68522.000000 | 1.000000 | 1.000000 | 74.000000 | 504.250000 | 33.000000 | 232.000000 | 50.000000 | ... | 7.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 |
| max | 11191.000000 | 128.000000 | 666666.000000 | 2.000000 | 2.000000 | 99.000000 | 1493.000000 | 199.000000 | 1725.000000 | 259.000000 | ... | 20.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 3.0 | 11.0 | 1.000000 |
8 rows × 26 columns
rp.summary_cont(data_num)
| Variable | N | Mean | SD | SE | 95% Conf. | Interval | |
|---|---|---|---|---|---|---|---|
| 0 | ID | 2240.0 | 5592.1598 | 3246.6622 | 68.5983 | 5457.6370 | 5726.6827 |
| 1 | Customers_Age | 2240.0 | 52.1942 | 11.9841 | 0.2532 | 51.6976 | 52.6907 |
| 2 | Income | 2216.0 | 52247.2514 | 25173.0767 | 534.7508 | 51198.5861 | 53295.9166 |
| 3 | Kidhome | 2240.0 | 0.4442 | 0.5384 | 0.0114 | 0.4219 | 0.4665 |
| 4 | Teenhome | 2240.0 | 0.5062 | 0.5445 | 0.0115 | 0.4837 | 0.5288 |
| 5 | Recency | 2240.0 | 49.1094 | 28.9625 | 0.6119 | 47.9093 | 50.3094 |
| 6 | MntWines | 2240.0 | 303.9357 | 336.5974 | 7.1119 | 289.9891 | 317.8824 |
| 7 | MntFruits | 2240.0 | 26.3022 | 39.7734 | 0.8404 | 24.6543 | 27.9502 |
| 8 | MntMeatProducts | 2240.0 | 166.9500 | 225.7154 | 4.7691 | 157.5977 | 176.3023 |
| 9 | MntFishProducts | 2240.0 | 37.5254 | 54.6290 | 1.1542 | 35.2619 | 39.7890 |
| 10 | MntSweetProducts | 2240.0 | 27.0629 | 41.2805 | 0.8722 | 25.3525 | 28.7734 |
| 11 | MntGoldProds | 2240.0 | 44.0219 | 52.1674 | 1.1022 | 41.8604 | 46.1834 |
| 12 | NumDealsPurchases | 2240.0 | 2.3250 | 1.9322 | 0.0408 | 2.2449 | 2.4051 |
| 13 | NumWebPurchases | 2240.0 | 4.0848 | 2.7787 | 0.0587 | 3.9697 | 4.2000 |
| 14 | NumCatalogPurchases | 2240.0 | 2.6621 | 2.9231 | 0.0618 | 2.5409 | 2.7832 |
| 15 | NumStorePurchases | 2240.0 | 5.7902 | 3.2510 | 0.0687 | 5.6555 | 5.9249 |
| 16 | NumWebVisitsMonth | 2240.0 | 5.3165 | 2.4266 | 0.0513 | 5.2160 | 5.4171 |
| 17 | AcceptedCmp3 | 2240.0 | 0.0728 | 0.2598 | 0.0055 | 0.0620 | 0.0835 |
| 18 | AcceptedCmp4 | 2240.0 | 0.0746 | 0.2627 | 0.0056 | 0.0637 | 0.0854 |
| 19 | AcceptedCmp5 | 2240.0 | 0.0728 | 0.2598 | 0.0055 | 0.0620 | 0.0835 |
| 20 | AcceptedCmp1 | 2240.0 | 0.0643 | 0.2453 | 0.0052 | 0.0541 | 0.0745 |
| 21 | AcceptedCmp2 | 2240.0 | 0.0134 | 0.1150 | 0.0024 | 0.0086 | 0.0182 |
| 22 | Complain | 2240.0 | 0.0094 | 0.0964 | 0.0020 | 0.0054 | 0.0134 |
| 23 | Z_CostContact | 2240.0 | 3.0000 | 0.0000 | 0.0000 | NaN | NaN |
| 24 | Z_Revenue | 2240.0 | 11.0000 | 0.0000 | 0.0000 | NaN | NaN |
| 25 | Last_Campaign | 2240.0 | 0.1491 | 0.3563 | 0.0075 | 0.1343 | 0.1639 |
plt.figure(figsize=(20,65))
for i, col in enumerate(data_num):
plt.subplot(11,3, i+1)
sns.distplot(data_num[col],
kde=True,
bins=30,
color= "teal",
rug=True,
rug_kws={"color": "r",
"alpha":0.3,
"linewidth": 0.2,
"height":0.1 },
kde_kws={"color": "b",
"alpha": 0.2,
"linewidth": 2,
"shade": True}
);
plt.tight_layout()
2- Distribution of Categorical Variables
cvd=data.select_dtypes(include=['category'])
rp.summary_cat(cvd)
| Variable | Outcome | Count | Percent | |
|---|---|---|---|---|
| 0 | Education | Graduation | 1127 | 50.31 |
| 1 | PhD | 486 | 21.70 | |
| 2 | Master | 370 | 16.52 | |
| 3 | 2n Cycle | 203 | 9.06 | |
| 4 | Basic | 54 | 2.41 | |
| 5 | Marital_Status | Married | 864 | 38.57 |
| 6 | Together | 580 | 25.89 | |
| 7 | Single | 480 | 21.43 | |
| 8 | Divorced | 232 | 10.36 | |
| 9 | Widow | 77 | 3.44 | |
| 10 | Alone | 3 | 0.13 | |
| 11 | Absurd | 2 | 0.09 | |
| 12 | YOLO | 2 | 0.09 |
edu=cvd["Education"].value_counts()
edu=pd.DataFrame(edu)
edu1=cvd["Marital_Status"].value_counts()
edu1=pd.DataFrame(edu1)
fig, ax= plt.subplots(1,2)
fig.set_size_inches(15,7)
ax[0].bar( edu.index,
edu["Education"])
ax[1].bar(edu1.index,
edu1["Marital_Status"])
plt.tight_layout()
3. Crossover of Variables
1- Crossover of Categorical Variables
cross= data_num.iloc[:,[6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,25]]
plt.figure(figsize=(20,45))
for i, col in enumerate(cross):
plt.subplot(9,2,i+1)
sns.pointplot(x=data["Education"],
y =cross[col],
hue=data["Marital_Status"],
data=cross);
plt.tight_layout()
Now we will briefly interpret the tables of the variables that are important to the company one by one;
MntWines: There seems to be an excess of master's education level living alone in wine consumption. The use of wine is also high among those who are university graduates and live as "Absurd" and "Divorced". However, wine use is remarkably low among University graduates and PhD graduates living alone. When we look at the table in general, the use of wine increases as the level of education increases. However, we do not see any interesting variation in wine consumption according to marital status.
However, the number of "Absurd" customers of the company that we should pay attention to here is very small. The company should take its steps with this low level in mind.
MntFruits: If we look at the fruit consumption of the company customers, I do not think that it varies according to the education level. Likewise, marital status does not affect much. However, the number of university graduates who are "Absurd" is still high. Then there is the increase in fruit consumption among customers at the "Widow" and "2nCycle" education level. Customers living alone have low fruit purchasing levels.
MntMeatProducts: The level of meat seems to increase according to the education level of widowed and single customers here. It has progressed in the opposite direction to other lifestyles. Meat purchasing levels of customers living alone are low.
MntFishProducts: In this table, we can see that people with "2n Cycle" education level who are widowed and living together have a high fish consumption. Other rods are close to each other, so there are no differences in fish consumption according to education level and lifestyle. Again, the rate of purchasing fish for those who live alone is low, and those of "Absurds" are high.
MntSweetProducts: We see that widows and people with "2nCycle" education level are high in sweet purchases. What draws our attention here is that as the level of education increases, the consumption of sweets decreases at the same level. But we can also see a significant difference between widows and people with other lifestyles.
MntGoldProds: We cannot say that gold purchase has anything to do with education level or marital status. Yes, there is an increase in the Master level in widows, but I do not think it is at a level that requires serious planning for the company. Customers with an "absurd" lifestyle are again at a high level.
NumDealsPurchases: According to the marital status of the customers shopping with a discount, we can see the differences in usage at the "Basic" education level. In fact, most of the variables do not seem to intersect, except for the University Graduate and Master education levels. This shows that there are many different discount usage preferences between education level and lifestyles in the use of the number of discounts.
NumWebPurchases: In this table, we can clearly see that as the level of education increases, the number of customers shopping on the company's website increases. There is not much difference between marital status (except for people living alone)
NumCatalogPurchases: There is no difference in the use of catalogs according to education level. The same can be said for lifestyles. There is not much difference other than those who live alone and those who are in the "Absurd" state. Those with only "Basic" education level are less likely to shop through the catalog.
NumStorePurchases: Interestingly, as the level of education increases, the number of purchases made from the store also increases. As a lifestyle, we can say that widows shop more than the store. Those who live alone, their store shopping decreases at the "phD" education level.
NumWebVisitsMonth: The education level of the visitors of the company's website that caught our attention is at "Basic" level. In the previous table, the level of shoppers was low at "Basic" education levels. There is not much change in other educational backgrounds and lifestyles. Individuals living alone have a slightly higher level of website visits, and university graduates have a lower level of website visits than others.
AcceptedCmp3: We see that people living together and alone prefer Campaign 3 more. However, there is no difference in the use of campaigns according to educational status.
AcceptedCmp4: If we look at Campaign4, the usage of campaign 4 changes according to the lifestyles of people in the "2n Cycle" education status. Campaign 4 usage of people in the "Divorced" and "Widow" lifestyle is high. The use of those who live alone is low. There does not appear to be much variation by education level.
AcceptedCmp5: There does not appear to be a change in the use of campaign 5 according to education level and lifestyle. Except for "absurd" lifestyles and widows.
AcceptedCmp1: We see that widows are high in campaign1 usage, there is no difference in education level. Except for "absurd" lifestyles. But it should not be forgotten that the company has very few customers with the "Absurd" lifestyle.
AcceptedCmp2: Customers with "Divorced" life style and "2nCycle" training have high campaign2 usage. Likewise, the use of campaign 2 increases in almost every lifestyle at the Phd education level.
Complete: In the last 2 years, the level of customers complaining has increased among customers with "2nCycle" education level, but it generally decreases as the education level increases.
Last_Campaign: The rate of use of the last campaign generally increases as the level of education increases, and at the same time, the rate of use by widows is high. The use of people with phD education level living alone is also significantly high. Usage rates vary considerably according to lifestyles.
(sns.FacetGrid(data,hue="Marital_Status",
height=5,
xlim=(0,100))
.map(sns.kdeplot,
"Recency",
shade=True)
.add_legend());
(sns.FacetGrid(data,hue="Marital_Status",
height=5,
xlim=(0,10))
.map(sns.kdeplot,
"NumDealsPurchases",
shade=True)
.add_legend());
(sns.FacetGrid(data,hue="Marital_Status",
height=5,
xlim=(0,11))
.map(sns.kdeplot,
"NumWebPurchases",
shade=True)
.add_legend());
(sns.FacetGrid(data,hue="Marital_Status",
height=5,
xlim=(0,15))
.map(sns.kdeplot,
"NumCatalogPurchases",
shade=True)
.add_legend());
(sns.FacetGrid(data,hue="Marital_Status",
height=5,
xlim=(0,15))
.map(sns.kdeplot,
"NumWebVisitsMonth",
shade=True)
.add_legend());
2- Crossover of Numeric Variables
plt.figure(figsize=(26,26))
corr= data_num.corr()
sns.heatmap(data=corr,
annot=True,
cbar=False,
square=True,
fmt=".2%")
plt.tight_layout()
Very weak correlation or no correlation , if r<20%
Weak correlation if between 20%-40%
Moderate correlation if 40%-60%
High correlation if 60%-80%
If it is >80%, it is interpreted that there is a very high correlation.
For example, there is a %72 correlation between "MntMeatProducts" with "NumCatalogPurchases". This mean that, who customers doing catalog purchases buy meat products.
On the other hand, there is a - %72 correlation between "MntMeatProducts" with "NumCatalogPurchases" This mean that, who customers doing catalog purchases dont buy meat products.
1. Data Preprocessing
1- Data Cleaning
1- Missing Data
data.isnull().sum()
ID 0 Customers_Age 0 Education 0 Marital_Status 0 Income 24 Kidhome 0 Teenhome 0 Dt_Customer 0 Recency 0 MntWines 0 MntFruits 0 MntMeatProducts 0 MntFishProducts 0 MntSweetProducts 0 MntGoldProds 0 NumDealsPurchases 0 NumWebPurchases 0 NumCatalogPurchases 0 NumStorePurchases 0 NumWebVisitsMonth 0 AcceptedCmp3 0 AcceptedCmp4 0 AcceptedCmp5 0 AcceptedCmp1 0 AcceptedCmp2 0 Complain 0 Z_CostContact 0 Z_Revenue 0 Last_Campaign 0 dtype: int64
m_values=data[data.isnull().any(axis=1)]
m_values
| ID | Customers_Age | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Last_Campaign | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 1994 | 38 | Graduation | Married | NaN | 1 | 0 | 2013-11-15 | 11 | 5 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 27 | 5255 | 35 | Graduation | Single | NaN | 1 | 0 | 2013-02-20 | 19 | 5 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 43 | 7281 | 62 | PhD | Single | NaN | 0 | 0 | 2013-05-11 | 80 | 81 | ... | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 48 | 7244 | 70 | Graduation | Single | NaN | 2 | 1 | 2014-01-01 | 96 | 48 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 58 | 8557 | 39 | Graduation | Single | NaN | 1 | 0 | 2013-06-17 | 57 | 11 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 71 | 10629 | 48 | 2n Cycle | Married | NaN | 1 | 0 | 2012-09-14 | 25 | 25 | ... | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 90 | 8996 | 64 | PhD | Married | NaN | 2 | 1 | 2012-11-19 | 4 | 230 | ... | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 91 | 9235 | 64 | Graduation | Single | NaN | 1 | 1 | 2014-05-27 | 45 | 7 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 92 | 5798 | 48 | Master | Together | NaN | 0 | 0 | 2013-11-23 | 87 | 445 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 128 | 8268 | 60 | PhD | Married | NaN | 0 | 1 | 2013-11-07 | 23 | 352 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 133 | 1295 | 58 | Graduation | Married | NaN | 0 | 1 | 2013-11-08 | 96 | 231 | ... | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 312 | 2437 | 32 | Graduation | Married | NaN | 0 | 0 | 2013-03-06 | 69 | 861 | ... | 3 | 0 | 1 | 0 | 1 | 0 | 0 | 3 | 11 | 0 |
| 319 | 2863 | 51 | Graduation | Single | NaN | 1 | 2 | 2013-08-23 | 67 | 738 | ... | 7 | 0 | 1 | 0 | 1 | 0 | 0 | 3 | 11 | 0 |
| 1379 | 10475 | 51 | Master | Together | NaN | 0 | 1 | 2013-01-04 | 39 | 187 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 1382 | 2902 | 63 | Graduation | Together | NaN | 1 | 1 | 2012-03-09 | 87 | 19 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 1383 | 4345 | 57 | 2n Cycle | Single | NaN | 1 | 1 | 2014-12-01 | 49 | 5 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 1386 | 3769 | 49 | PhD | Together | NaN | 1 | 0 | 2014-02-03 | 17 | 25 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2059 | 7187 | 52 | Master | Together | NaN | 1 | 1 | 2013-05-18 | 52 | 375 | ... | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2061 | 1612 | 40 | PhD | Single | NaN | 1 | 0 | 2013-05-31 | 82 | 23 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2078 | 5079 | 50 | Graduation | Married | NaN | 1 | 1 | 2013-03-03 | 82 | 71 | ... | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2079 | 10339 | 67 | Master | Together | NaN | 0 | 1 | 2013-06-23 | 83 | 161 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2081 | 3117 | 66 | Graduation | Single | NaN | 0 | 1 | 2013-10-18 | 95 | 264 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2084 | 5250 | 78 | Master | Widow | NaN | 0 | 0 | 2013-10-30 | 75 | 532 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 11 | 1 |
| 2228 | 8720 | 43 | 2n Cycle | Together | NaN | 0 | 0 | 2012-12-08 | 53 | 32 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
24 rows × 29 columns
null_col=m_values.iloc[:,2:4]
null_col.value_counts()
Education Marital_Status
Graduation Single 6
Married 4
Master Together 4
PhD Married 2
Single 2
2n Cycle Married 1
Single 1
Together 1
Graduation Together 1
Master Widow 1
PhD Together 1
dtype: int64
g_m= data.loc[(data["Education"] == "Graduation") & (data["Marital_Status"] == "Married")]
g_s= data.loc[(data["Education"] == "Graduation") & (data["Marital_Status"] == "Single")]
p_s= data.loc[(data["Education"] == "PhD") & (data["Marital_Status"] == "Single")]
n2_m= data.loc[(data["Education"] == "2n Cycle") & (data["Marital_Status"] == "Married")]
p_m= data.loc[(data["Education"] == "PhD") & (data["Marital_Status"] == "Married")]
m_t= data.loc[(data["Education"] == "Master") & (data["Marital_Status"] == "Together")]
g_t= data.loc[(data["Education"] == "Graduation") & (data["Marital_Status"] == "Together")]
n2_s= data.loc[(data["Education"] == "2n Cycle") & (data["Marital_Status"] == "Single")]
p_t= data.loc[(data["Education"] == "PhD") & (data["Marital_Status"] == "Together")]
m_w= data.loc[(data["Education"] == "Master") & (data["Marital_Status"] == "Widow")]
n2_t= data.loc[(data["Education"] == "2n Cycle") & (data["Marital_Status"] == "Together")]
g_m["Income"]=g_m["Income"].fillna(g_m["Income"].mean())
g_s["Income"]=g_s["Income"].fillna(g_s["Income"].mean())
p_s["Income"]=p_s["Income"].fillna(p_s["Income"].mean())
n2_m["Income"]=n2_m["Income"].fillna(n2_m["Income"].mean())
p_m["Income"]=p_m["Income"].fillna(p_m["Income"].mean())
m_t["Income"]=m_t["Income"].fillna(m_t["Income"].mean())
g_t["Income"]=g_t["Income"].fillna(g_t["Income"].mean())
n2_s["Income"]=n2_s["Income"].fillna(n2_s["Income"].mean())
p_t["Income"]=p_t["Income"].fillna(p_t["Income"].mean())
m_w["Income"]=m_w["Income"].fillna(m_w["Income"].mean())
n2_t["Income"]=n2_t["Income"].fillna(n2_t["Income"].mean())
data.loc[(data["Education"] == "Graduation") & (data["Marital_Status"] == "Married")]=g_m
data.loc[(data["Education"] == "Graduation") & (data["Marital_Status"] == "Single")]=g_s
data.loc[(data["Education"] == "PhD") & (data["Marital_Status"] == "Single")]=p_s
data.loc[(data["Education"] == "2n Cycle") & (data["Marital_Status"] == "Married")]=n2_m
data.loc[(data["Education"] == "PhD") & (data["Marital_Status"] == "Married")]=p_m
data.loc[(data["Education"] == "Master") & (data["Marital_Status"] == "Together")]=m_t
data.loc[(data["Education"] == "Graduation") & (data["Marital_Status"] == "Together")]=g_t
data.loc[(data["Education"] == "2n Cycle") & (data["Marital_Status"] == "Single")]=n2_s
data.loc[(data["Education"] == "PhD") & (data["Marital_Status"] == "Together")]=p_t
data.loc[(data["Education"] == "Master") & (data["Marital_Status"] == "Widow")]=m_w
data.loc[(data["Education"] == "2n Cycle") & (data["Marital_Status"] == "Together")]=n2_t
Thus, we can get maximum clustering values by assigning the most realistic values to "NaN" values.
Outlier and Noisy Data
local=data.drop(["Dt_Customer"],
axis=1)
local=pd.get_dummies(local,
columns=['Marital_Status', "Education"])
clf= LocalOutlierFactor(n_neighbors=20,
contamination = 0.1)
clf.fit_predict(local)
loc_scores=clf.negative_outlier_factor_
np.sort(loc_scores)[0:50]
array([-56.19805941, -7.10274346, -6.98616886, -6.74873272,
-6.73106366, -6.72169496, -6.70667403, -6.46774986,
-3.31150141, -2.13857395, -1.86039541, -1.75062273,
-1.67663123, -1.45638163, -1.39865009, -1.36287587,
-1.35221979, -1.33729058, -1.33451043, -1.3091405 ,
-1.30648389, -1.29005723, -1.27554315, -1.26606338,
-1.26529081, -1.26424352, -1.26104893, -1.26101868,
-1.24187776, -1.23875497, -1.2320278 , -1.22132512,
-1.20488908, -1.20283968, -1.19829423, -1.19604059,
-1.19595727, -1.19416756, -1.17410376, -1.17391938,
-1.17369019, -1.16948542, -1.1691941 , -1.1656196 ,
-1.16168247, -1.16127158, -1.1608729 , -1.15849023,
-1.15783501, -1.15750502])
thre_val=np.sort(loc_scores)[10]
outlier_data= data[loc_scores< thre_val]
new_data= data[loc_scores > thre_val]
outlier_data
| ID | Customers_Age | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Last_Campaign | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 164 | 8475 | 48 | PhD | Married | 157243.0 | 0 | 1 | 2014-01-03 | 98 | 20 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 617 | 1503 | 45 | PhD | Together | 162397.0 | 1 | 1 | 2013-03-06 | 31 | 85 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 646 | 4611 | 51 | Graduation | Together | 105471.0 | 0 | 0 | 2013-01-21 | 36 | 1009 | ... | 3 | 0 | 0 | 1 | 1 | 0 | 0 | 3 | 11 | 1 |
| 655 | 5555 | 46 | Graduation | Divorced | 153924.0 | 0 | 0 | 2014-07-02 | 81 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 687 | 1501 | 39 | PhD | Married | 160803.0 | 0 | 0 | 2012-04-08 | 21 | 55 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 1300 | 5336 | 50 | Master | Together | 157733.0 | 1 | 0 | 2013-04-06 | 37 | 39 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 1653 | 4931 | 44 | Graduation | Together | 157146.0 | 0 | 0 | 2013-04-29 | 13 | 1 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 1898 | 4619 | 76 | PhD | Single | 113734.0 | 0 | 0 | 2014-05-28 | 9 | 6 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2132 | 11181 | 72 | PhD | Married | 156924.0 | 0 | 0 | 2013-08-29 | 85 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2233 | 9432 | 44 | Graduation | Together | 666666.0 | 1 | 0 | 2013-02-06 | 23 | 9 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
10 rows × 29 columns
new_data.head()
| ID | Customers_Age | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Last_Campaign | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 64 | Graduation | Single | 58138.0 | 0 | 0 | 2012-04-09 | 58 | 635 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 1 | 2174 | 67 | Graduation | Single | 46344.0 | 1 | 1 | 2014-08-03 | 38 | 11 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2 | 4141 | 56 | Graduation | Together | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426 | ... | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 3 | 6182 | 37 | Graduation | Together | 26646.0 | 1 | 0 | 2014-10-02 | 26 | 11 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4 | 5324 | 40 | PhD | Married | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
5 rows × 29 columns
new_data_num=new_data.select_dtypes(include=["int64",
"float64"])
plt.figure(figsize=(20,65))
for i, col in enumerate(new_data_num):
X= new_data_num[col].values
X= X.reshape(-1,1)
model= KMeans(200)
model.fit(X)
yhat = model.predict(X)
plt.subplot(9,3,
i+1)
plt.xlabel(new_data_num.columns[i])
plt.scatter(X,
yhat,
s=15,
c='black');
plt.tight_layout()
new_data["Customers_Age"]= new_data["Customers_Age"].apply(lambda x: 89 if x > 89 else x).astype(int)
new_data_num=new_data.select_dtypes(include=["int64",
"float64",
"int32"])
plt.figure(figsize=(20,55))
for i, col in enumerate(new_data_num):
plt.subplot(11,3,
i+1)
sns.violinplot(new_data_num[col]);
plt.tight_layout()
new_clus_data=pd.get_dummies(new_data,
columns=['Marital_Status', "Education"])
standart_data=new_clus_data.drop(["Dt_Customer"],
axis=1)
standart_data = preprocessing.scale(standart_data)
standart_data
array([[-0.02020566, 1.00643103, 0.32036554, ..., 0.99329302,
-0.44540666, -0.52456804],
[-1.05214421, 1.26128786, -0.25462967, ..., 0.99329302,
-0.44540666, -0.52456804],
[-0.44622686, 0.32681282, 0.97731488, ..., 0.99329302,
-0.44540666, -0.52456804],
...,
[ 0.51763455, -1.03242361, 0.26395809, ..., 0.99329302,
-0.44540666, -0.52456804],
[ 0.81489446, 1.09138331, 0.86186731, ..., -1.00675227,
2.24513928, -0.52456804],
[ 1.17530285, 1.26128786, 0.06348494, ..., -1.00675227,
-0.44540666, 1.90633041]])
pca = PCA(n_components=0.95,
random_state=0)
standart_data= pca.fit_transform(standart_data)
total_var = pca.explained_variance_ratio_.sum() * 100
print("Total Variance: %",total_var)
Total Variance: % 95.63252392024631
X= standart_data
wcss = []
for i in range(1, 11):
km = KMeans(n_clusters = i,
random_state = 0)
km.fit(X)
wcss.append(km.inertia_)
fig = px.line( x= range(1, 11),
y=wcss, markers=True)
fig.update_layout(width = 990,
height = 600,
showlegend = False,
title='The Elbow Method',
title_x = 0.5,
xaxis_title='Number of Clusters',
font = dict(family = "TimesNewRoman",
color="black",
size = 15),
yaxis_title='WCSS')
fig.add_annotation(dict(x=0.26, y=0.47, xref="paper",
yref="paper",
text='The Elbow Point',
font_size = 20,
showarrow=True,
arrowhead=4,
arrowsize=2,
arrowwidth=1,
ax=50,ay=-70))
fig.show()
km = KMeans(n_clusters = 3 ,
random_state = 0).fit(X)
comp= km.fit_transform(X)
labels = km.predict(X)
new_data["Clusters"] = labels
re_clust = {
0: "SEGMENT II",
1: "SEGMENT III",
2: "SEGMENT I",
}
new_data["Clusters"] = new_data["Clusters"].map(re_clust)
new_data["Clusters"].value_counts()
SEGMENT I 1012 SEGMENT II 656 SEGMENT III 561 Name: Clusters, dtype: int64
I decide to use KMeans clustering algorithm because of fast, cheap and be convenient.
Resoruces= [https://en.wikipedia.org/wiki/K-means_clustering]
You may ask how to determine the name of the clusters. I name it "SEGMENT I" whichever cluster has more customers. ie SEGMENT I > SEGMENT II > SEGMENT III (according to customer number). I haven't set a specific feature by cluster yet. In the conclusion, we will find the set properties by reviewing the tables. So these names are just descriptive.
fig = px.scatter_3d(
comp,
x=0,
y=1,
z=2,
color=new_data['Clusters'],
labels={'0': 'SEGMENT II',
'1': 'SEGMENT III',
'2': 'SEGMENT I'})
fig.update_traces(hovertemplate = 'SEGMENT II: %{x} <br>SEGMENT III: %{y} <br>SEGMENT I: %{z}',
marker=dict(size=2.5,
line=dict(width=2)))
fig.update_layout(width = 900, height = 900,
showlegend = False,
scene = dict(xaxis = dict(title = 'SEGMENT II',
titlefont_color = 'black'),
yaxis = dict(title = 'SEGMENT III',
titlefont_color = 'black'),
zaxis = dict(title = 'SEGMENT I',
titlefont_color = 'black')),
font = dict(family = "TimesNewRoman",
color="black", size = 13),
title_text = 'CUSTOMERS CLUSTERS BY SEGMENTS',
title_x = 0.5,
legend = dict(font = dict(size = 15,
color = "black"))
)
fig.show()
Actually, this part is very important as Clustering part as. Because, when you show the segments features to managers, or marketing or operation departments, they must understand be quickly. Because they didn't have in the all of the process. So they dont have any information about data. We must visualize the features of the segments good and make them easy to understand.
So ı will not explanation under the graph. You need understand with no explanation
fig = px.pie(new_data["Clusters"].value_counts().reset_index(),
values = 'Clusters',
names = 'index',
width = 700,
height = 700)
fig.update_traces(textposition = 'inside',
textinfo = 'percent + label + value' ,
hole = 0.8,
marker = dict(colors = ['#dd4124','#009473', '#336b87'],
line = dict(color = 'white',
width = 2)),
hovertemplate = 'Clients: %{value}')
fig.update_layout(annotations = [dict(text = 'Number of clients <br>by cluster. Total: 2229',
x = 0.5,
y = 0.5,
font_size = 28,
showarrow = False,
font_family = 'TimesNewRoman',
font_color = 'black')],
showlegend = False)
fig.show()
segment1=new_data[new_data["Clusters"]=="SEGMENT I"].reset_index(drop=True)
segment2=new_data[new_data["Clusters"]=="SEGMENT II"].reset_index(drop=True)
segment3=new_data[new_data["Clusters"]=="SEGMENT III"].reset_index(drop=True)
colors= ["#626EFA", "#EF553B", "#00CC96", "#AB63FA"]
y1=[segment1["Customers_Age"][( segment1["Customers_Age"]<= 18)].shape[0],
segment2["Customers_Age"][( segment2["Customers_Age"]<= 18)].shape[0],
segment3["Customers_Age"][( segment3["Customers_Age"]<= 18)].shape[0]]
y2=[segment1["Customers_Age"][( segment1["Customers_Age"]>= 19) & (segment1["Customers_Age"]<=35)].shape[0],
segment2["Customers_Age"][( segment2["Customers_Age"]>= 19) & (segment2["Customers_Age"]<=35)].shape[0],
segment3["Customers_Age"][( segment3["Customers_Age"]>= 19) & (segment3["Customers_Age"]<=35)].shape[0]]
y3=[segment1["Customers_Age"][( segment1["Customers_Age"]>= 36) & (segment1["Customers_Age"]<=60)].shape[0],
segment2["Customers_Age"][( segment2["Customers_Age"]>= 36) & (segment2["Customers_Age"]<=60)].shape[0],
segment3["Customers_Age"][( segment3["Customers_Age"]>= 36) & (segment3["Customers_Age"]<=60)].shape[0]]
y4=[segment1["Customers_Age"][( segment1["Customers_Age"]>=61)].shape[0],
segment2["Customers_Age"][( segment2["Customers_Age"]>=61)].shape[0],
segment3["Customers_Age"][( segment3["Customers_Age"]>=61)].shape[0]]
trace1= go.Bar(name='Customers 18 and under',
y=y1,
text=y1,
xaxis='x2',
yaxis='y2')
trace2= go.Bar(name='Customers between 19 and 35',
y=y2,
text=y2,
xaxis='x2',
yaxis='y2')
trace3= go.Bar(name= "Customers between 36 and 60",
y=y3,
text=y3,
xaxis='x2',
yaxis='y2')
trace4= go.Bar(name= "Customers 61 and over",
text=y4,
y=y4,
xaxis='x2',
yaxis='y2')
fig = make_subplots(rows=2, cols=2, specs=[[{"type": "xy"},
{"type": "domain"}],
[{"type": "domain"},
{"type": "domain"}]]
)
fig.add_traces([trace1,
trace2,
trace3,
trace4],
rows=1,
cols=1)
fig.add_trace(go.Pie(values=[segment1["Customers_Age"][( segment1["Customers_Age"]<= 18)].shape[0],
segment1["Customers_Age"][( segment1["Customers_Age"]>= 19) & (segment1["Customers_Age"]<=35)].shape[0],
segment1["Customers_Age"][( segment1["Customers_Age"]>= 36) & (segment1["Customers_Age"]<=60)].shape[0],
segment1["Customers_Age"][( segment1["Customers_Age"]>=61)].shape[0]],
name="SEGMENT I",
title="SEGMENT I",
titleposition="bottom center",
textinfo="percent",
textfont_size=14,
hole = 0.5,
showlegend=False,
marker_colors= colors),
row=1,
col=2)
fig.add_trace(go.Pie(values=[segment2["Customers_Age"][( segment2["Customers_Age"]<= 18)].shape[0],
segment2["Customers_Age"][( segment2["Customers_Age"]>= 19) & (segment2["Customers_Age"]<=35)].shape[0],
segment2["Customers_Age"][( segment2["Customers_Age"]>= 36) & (segment2["Customers_Age"]<=60)].shape[0],
segment2["Customers_Age"][( segment2["Customers_Age"]>=61)].shape[0]],
name="SEGMENT II",
title="SEGMENT II",
titleposition="bottom center",
textinfo="percent",
textfont_size=14,
hole = 0.5,
showlegend=False,
marker_colors= colors),
row=2,
col=1)
fig.add_trace(go.Pie(values=[segment3["Customers_Age"][( segment3["Customers_Age"]<= 18)].shape[0],
segment3["Customers_Age"][( segment3["Customers_Age"]>= 19) & (segment3["Customers_Age"]<=35)].shape[0],
segment3["Customers_Age"][( segment3["Customers_Age"]>= 36) & (segment3["Customers_Age"]<=60)].shape[0],
segment3["Customers_Age"][( segment3["Customers_Age"]>=61)].shape[0]],
name="SEGMENT III",
title="SEGMENT III",
titleposition="bottom center",
textinfo="percent",
textfont_size=14,
hole = 0.5,
showlegend=False,
marker_colors= colors),
row=2,
col=2)
for col in [1,2]:
if col== 2:
fig.add_annotation(dict(x=0.16,
y=0.21,
xref="paper",
yref="paper",
text= 'TOTAL : 548',
showarrow=False))
for row in [1,2]:
if row== 1:
fig.add_annotation(dict(x=0.847,
y=0.82,
xref="paper",
yref="paper",
text='TOTAL : 1012',
showarrow=False))
elif row==2:
fig.add_annotation(dict(x=0.84, y=0.21, xref="paper",
yref="paper",
text='TOTAL : 638',
showarrow=False))
fig.add_annotation(dict(x=0.5, y=0.5, xref="paper",
yref="paper",
text='MAXIMUM AGE = 89',
font_size = 20,
showarrow=False))
fig.add_annotation(dict(x=0.495, y=0.475, xref="paper",
yref="paper",
text='MINIMUM AGE = 25',
font_size = 20,
showarrow=False))
fig.update_traces(textposition="outside")
fig.update_layout(height=1000,
width=1000,
title_text="DISPERSION OF CUSTOMERS AGE ACCORDING TO SEGMENTS",
font_family = 'TimesNewRoman',
font_size = 15,
font_color= "black",
title_x = 0.5
)
fig.show()
segment_1 = segment1["Income"].values
segment_2 = segment2["Income"].values
segment_3= segment3["Income"].values
group_labels = ['SEGMENT I',
'SEGMENT II',
"SEGMENT III"]
fig = ff.create_distplot([segment_1,
segment_2,
segment_3],
group_labels,
bin_size=1000,
curve_type='normal')
fig.update_layout(height=700,
width=950,
title_text="DISPERSION OF CUSTOMERS INCOME ACCORDING TO SEGMENTS",
font_family = 'TimesNewRoman',
font_size = 15,
title_x = 0.5,
font_color= "black"
)
fig.show()
segment_stat1= pd.concat([segment1["Marital_Status"],
segment1["Kidhome"],
segment1["Teenhome"]],
axis=1)
segment_stat2=pd.concat([segment2["Marital_Status"],
segment2["Kidhome"],
segment2["Teenhome"]],
axis=1)
segment_stat3=pd.concat([segment3["Marital_Status"],
segment3["Kidhome"],
segment3["Teenhome"]],
axis=1)
segment1_table= pd.DataFrame(segment_stat1.value_counts())
segment2_table= pd.DataFrame(segment_stat2.value_counts())
segment3_table= pd.DataFrame(segment_stat3.value_counts())
segment1_table.reset_index(inplace=True)
segment2_table.reset_index(inplace=True)
segment3_table.reset_index(inplace=True)
values=["Marital Status", "Number of kids in household", "Number of kids in household", "Number of Customer"]
fig = make_subplots(rows=3,
cols=1,
specs=[[{"type": "table"}],
[{"type": "table"}],
[{"type": "table"}]])
fig.add_trace(go.Table(
header=dict(values=list(values),
fill_color='rgb(139, 224, 164)',
align=['left',
'center',
'center',
'center'],
font=dict(color='black',
size=14),),
cells=dict(values=[segment1_table["Marital_Status"],
segment1_table["Kidhome"],
segment1_table["Teenhome"],
segment1_table[0]],
fill_color='rgb(179, 205, 207)',
font_size=12,
align=["left",
'center',
'center',
'center'])),
row=1,
col=1)
fig.add_trace(go.Table(
header=dict(values=list(values),
fill_color='rgb(139, 224, 164)',
align=['left','center', 'center', 'center'],
font=dict(color='black',
size=14),),
cells=dict(values=[segment2_table["Marital_Status"],
segment2_table["Kidhome"],
segment2_table["Teenhome"],
segment2_table[0]],
fill_color='rgb(179, 205, 207)',
font_size=12,
align=["left",
'center',
'center',
'center'])),
row=2,
col=1)
fig.add_trace (go.Table(
header=dict(values=list(values),
fill_color='rgb(139, 224, 164)',
align=['left',
'center',
'center',
'center'],
font=dict(color='black',
size=14),),
cells=dict(values=[segment3_table["Marital_Status"],
segment3_table["Kidhome"],
segment3_table["Teenhome"],
segment3_table[0]],
fill_color='rgb(179, 205, 207)',
font_size=12,
align=["left",
'center',
'center',
'center'])),
row=3,
col=1)
fig.update_layout(height=1100,
width=950,
title_text="NUMBER OF KID/TEEN IN HOUSEHOLD ACCORDING TO CUSTOMERS MARITAL STATUS",
font_family = 'TimesNewRoman',
font_size = 15,
font_color= "black"
)
fig.add_annotation(dict(x=0.001, y=1.03, xref="paper",
yref="paper",
text='SEGMENT I',
font_size = 20,
showarrow=False))
fig.add_annotation(dict(x=0.001, y=0.65, xref="paper",
yref="paper",
text='SEGMENT II',
font_size = 20,
showarrow=False))
fig.add_annotation(dict(x=0.001, y=0.27, xref="paper",
yref="paper",
text='SEGMENT III',
font_size = 20,
showarrow=False))
fig.show()
fig = make_subplots(rows=3,
cols=1,
specs=[[{"type": "domain"}],
[{"type": "domain"}],
[{"type": "domain"}]])
colors= ["#626EFA",
"#EF553B",
"#00CC96",
"#AB63FA",
"#FFA15A",
"#19D3F3"]
labels=["Amount spent on Wine in last 2 years",
"Amount spent on Fruits in last 2 years",
"Amount spent on Meat Products in last 2 years",
"Amount spent on Fish Products in last 2 years",
"Amount spent on Sweet Products in last 2 years",
"Amount spent on Gold Products in last 2 years"
]
fig.add_trace(go.Pie(labels=labels,
values=[segment1["MntWines"].sum(),
segment1["MntFruits"].sum(),
segment1["MntMeatProducts"].sum(),
segment1["MntFishProducts"].sum(),
segment1["MntSweetProducts"].sum(),
segment1["MntGoldProds"].sum()],
name="SEGMENT I",
title="SEGMENT I",
titleposition="bottom center",
text=["WINE", "FRUITS", "MEAT", "FISH", "SWEET", "GOLD"],
textinfo="percent + text",
textfont_size=14,
hole = 0.3,
marker_colors= colors),
row=1,
col=1)
fig.add_trace(go.Pie(labels=labels,
values=[segment2["MntWines"].sum(),
segment2["MntFruits"].sum(),
segment2["MntMeatProducts"].sum(),
segment2["MntFishProducts"].sum(),
segment2["MntSweetProducts"].sum(),
segment2["MntGoldProds"].sum()],
name="SEGMENT II",
title="SEGMENT II",
titleposition="bottom center",
text=["WINE", "FRUITS", "MEAT", "FISH", "SWEET", "GOLD"],
textinfo="percent + text",
rotation=90,
textfont_size=14,
hole = 0.3,
marker_colors= colors),
row=2,
col=1)
fig.add_trace(go.Pie(labels=labels,
values=[segment3["MntWines"].sum(),
segment3["MntFruits"].sum(),
segment3["MntMeatProducts"].sum(),
segment3["MntFishProducts"].sum(),
segment3["MntSweetProducts"].sum(),
segment3["MntGoldProds"].sum()],
name="SEGMENT III",
title="SEGMENT III",
titleposition="bottom center",
text=["WINE", "FRUITS", "MEAT", "FISH", "SWEET", "GOLD"],
rotation=90,
textinfo="percent + text",
textfont_size=14,
hole = 0.3,
marker_colors= colors),
row=3,
col=1)
fig.update_layout(height=1300,
width=1000,
title_text="PERCENTAGE CUSTOMERS SPENT ON PRODUCTS LAST 2 YEAR ACCORDING TO SEGMENTS",
font_family = 'TimesNewRoman',
font_size = 15,
font_color= "black",
)
fig.show()
y_complain=[segment1["Complain"].value_counts()[1],
segment2["Complain"].value_counts()[1],
segment3["Complain"].value_counts()[1]]
y_recency=[segment1["Recency"].value_counts(),
segment2["Recency"].value_counts(),
segment3["Recency"].value_counts()]
segment_1 = segment1["Recency"].value_counts()
segment_2 = segment2["Recency"].value_counts()
segment_3= segment3["Recency"].value_counts()
y_web=[segment1["NumWebPurchases"].sum(),
segment2["NumWebPurchases"].sum(),
segment3["NumWebPurchases"].sum()]
y_catalog=[segment1["NumCatalogPurchases"].sum(),
segment2["NumCatalogPurchases"].sum(),
segment3["NumCatalogPurchases"].sum()]
y_store= [segment1["NumStorePurchases"].sum(),
segment2["NumStorePurchases"].sum(),
segment3["NumStorePurchases"].sum()]
y_webvisit=[segment1["NumWebVisitsMonth"].sum(),
segment2["NumWebVisitsMonth"].sum(),
segment3["NumWebVisitsMonth"].sum()]
labels= ["SEGMENT I", "SEGMENT II", "SEGMENT III"]
fig = make_subplots(rows=2,
cols=2,
specs=[[{"colspan":2},
None],
[{},{}]],
subplot_titles=("PURCHASING AREAS OF CUSTOMERS BY SEGMENTS",
"Number of Web Visits per Month",
"Number of Customers Complaining in last 2 years"))
fig.add_trace(go.Bar( y= labels,
x= y_web,
text= y_web,
orientation="h",
name= "Number of Web Purchases"),
row=1,
col=1)
fig.add_trace(go.Bar( y= labels,
x= y_catalog,
text=y_catalog,
orientation="h",
name= "Number of Catalog Purchases"),
row=1,
col=1)
fig.add_trace(go.Bar( y= labels,
x= y_store,
text=y_store,
orientation="h",
name= "Number of Store Purchases"),
row=1,
col=1)
fig.add_trace(go.Bar( x= labels,
y= y_webvisit,
text= y_webvisit,
name= "Number of Web Visits per Month"),
row=2,
col=1)
fig.add_trace(go.Bar(x=labels,
y=y_complain,
text=y_complain,
name = "Number of Customers Complaining in last 2 years"),
row=2,
col=2)
fig.update_traces(textposition="inside")
fig.update_layout(height=1000,
width=900,
font_family = 'TimesNewRoman',
font_size = 15,
font_color= "black",
barmode='relative',
legend=dict(
orientation="h",
yanchor="auto",
y=0.5,
xanchor="auto",
x=0.8)
)
fig.show()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-1-b6f98d2f98f7> in <module> ----> 1 y_complain=[segment1["Complain"].value_counts()[1], 2 segment2["Complain"].value_counts()[1], 3 segment3["Complain"].value_counts()[1]] 4 5 y_recency=[segment1["Recency"].value_counts(), NameError: name 'segment1' is not defined
labels= ["Campaign 1",
"Campaign 2",
"Campaign 3",
"Campaign 4",
"Campaign 5",
"Last Campaign"]
y1=[segment1["AcceptedCmp1"].sum(),
segment1["AcceptedCmp2"].sum(),
segment1["AcceptedCmp3"].sum(),
segment1["AcceptedCmp4"].sum(),
segment1["AcceptedCmp5"].sum(),
segment1["Last_Campaign"].sum(),
segment1["NumDealsPurchases"].sum()]
y2=[segment2["AcceptedCmp1"].sum(),
segment2["AcceptedCmp2"].sum(),
segment2["AcceptedCmp3"].sum(),
segment2["AcceptedCmp4"].sum(),
segment2["AcceptedCmp5"].sum(),
segment2["Last_Campaign"].sum(),
segment2["NumDealsPurchases"].sum()]
y3=[segment3["AcceptedCmp1"].sum(),
segment3["AcceptedCmp2"].sum(),
segment3["AcceptedCmp3"].sum(),
segment3["AcceptedCmp4"].sum(),
segment3["AcceptedCmp5"].sum(),
segment3["Last_Campaign"].sum(),
segment3["NumDealsPurchases"].sum()]
fig = go.Figure()
fig = make_subplots(rows=2,
cols=1,
specs=[[{"type": "polar"}],
[{"type": "polar"}]])
fig.add_trace(go.Scatterpolar(
r=y1,
theta=labels,
fill='toself',
name='SEGMENT I'),
row=1,
col=1)
fig.add_trace(go.Scatterpolar(
r=y2,
theta=labels,
fill='tonext',
name='SEGMENT II'),
row=1,
col=1)
fig.add_trace(go.Scatterpolar(
r=y3,
theta=labels,
fill='tonext',
name='SEGMENT III'),
row=1,
col=1)
fig.add_trace(go.Scatterpolar(
r=[segment1["NumDealsPurchases"].sum(),
segment2["NumDealsPurchases"].sum(),
segment3["NumDealsPurchases"].sum()],
theta=["SEGMENT I",
"SEGMENT II",
"SEGMENT III"],
fill='toself',
name="Number of Deals Purchases",),
row=2,
col=1)
fig.update_layout(
title_text="CAMPAIGN USAGE BY SEGMENTS",
title_x = 0.5,
font_family = 'TimesNewRoman',
font_size = 15,
font_color= "black",
height=1200,
width=990,
polar=dict(
radialaxis=dict(
visible=True,
)),
showlegend=True
)
fig.show()
segment_1 = segment1["Recency"].value_counts()
segment_2 = segment2["Recency"].value_counts()
segment_3= segment3["Recency"].value_counts()
group_labels = ['SEGMENT I',
'SEGMENT II',
"SEGMENT III"]
fig = ff.create_distplot([segment_1,
segment_2,
segment_3],
group_labels,
bin_size=.7,
curve_type='normal')
fig.update_layout(height=850,
width=990,
title_text="NUMBER OF DAYS SINCE CUSTOMER'S LAST PURCHASE",
font_family = 'TimesNewRoman',
font_size = 15,
font_color= "black",
title_x = 0.5
)
fig.show()